Modeling US Flight Delays

Ding Ding Ding

Author

Aymen Lamsahel, Ben Caterine, Haneef Usmani

Published

March 14, 2023

Abstract
The goal of this study is to develop an inferential model of US flight delay times using a 2015 dataset of over four million flights, as well as a temperature dataset. The resulting model was developed using variable transformations gained from EDA insights, such as adding a logarithmic term for flight distance. The model was also improved using variable selection techniques like forward selection and coefficient shrinkage methods like lasso regression. The model’s equation was used to infer insights for the factors associated with delays.

0.1 Background / Motivation

The US aviation industry transports millions of Americans every year, providing near-essential travel all across the country. Over the past few years, we have seen how precarious this industry can be with the COVID-19 pandemic, which saw almost all flights suspended for an extended period. Then, this past December and January, Southwest Airlines was plagued by issues in their decades-old computer systems which caused a massive meltdown of the entire airline’s scheduling procedures, causing thousands of flight cancellations [1].

These catastrophic problems have led many to question the structural longevity of the aviation industry. To this end, we wanted to look at probably the most widespread side-effect of industry problems: flight delays. Flights can be delayed due to a variety of factors: weather, airline problems (like Southwest this winter), airport-specific problems like runway/tarmac congestion, and more. In this project, we aim to use a dataset of over 4 million domestic flights from 2015 to infer which factors of a flight are associated with flight delays.

0.2 Problem statement

Our project aims to analyze flight delay data and identify the predictors that have the greatest impact on flight delays. We will explore the relationship between various predictors, such as weather conditions, flight distance, airline carrier, and time of day, and how they affect flight delays. The goal is to provide insights to airlines and airports on how to reduce flight delays and improve the travel experience for passengers. We will approach this problem as an inference task, focusing on understanding the relationships between the predictors and the response variable. Specifically, we will use regression analysis to model the relationship between flight delay time and the various predictors. Ultimately, the insights gained from this project will have the potential to reduce the negative impact of flight delays on individuals, airlines, and the economy as a whole.

0.3 Data sources

Flight Data: 2015 Flight Delays and Cancellations

Our flight data comes from the US Department of Transportation’s Bureau of Transportation Statistics, published on Kaggle. It contains information for US domestic flights in 2015. This was the most comprehensive dataset available with regard to the amount of information available for each flight, so we elected to use it even though it is somewhat older data, at least in the sense that it is from prior to recent societal changes (COVID) which have changed the aviation industry. The downside of using data from 2015 is that it prevents us from being able to conduct a true prediction study on this data; flight data from only 2015 cannot be used to predict flight delays in 2023. However, it stands to reason that factors associated with delays in 2015 are mostly still associated with delays in 2023, so we can use the data to perform an inference study of these variables.

This dataset contains three separate files: - airlines.csv: Contains the IATA code for each airline in the dataset (i.e., Southwest Airlines = WN). - airports.csv: Contains the IATA code and location for each airport, including latitude and longitude (i.e., Washington Dulles Airport is code IAD in Chantilly, VA) - flights.csv: Contains information for each flight in the dataset, including the date, airline, flight number, origin and destination airports, scheduled and actual departure and arrival times, and the response variable, departure delay

Temperature Data: Daily Temperature of Major Cities

As discussed in the Data Cleaning section below, we decided that weather was another important factor in flight delays, so we wanted to incorporate it. The best dataset we could find which had weather data for most airport cities on most days in 2015 was this temperature dataset which comes from the University of Dayton and was published by SRK on Kaggle.

This dataset contains daily average temperatures for major world cities from 1995 to 2020. Included in the list are around 200 US cities, which encompass the locations of most airports in the flights dataset.

0.4 Stakeholders

Passengers: US airline passengers are the primary stakeholders because they are the ones most impacted by flight delays and cancellations. Late arrivals and missed connections disrupt passengers’ lives, and our model will help passengers avoid booking flights that are likely to be majorly delayed.

Airlines: The airlines also have interest in our project because they schedule flights and stand to lose money if they are delayed/canceled. By using our model, they can identify certain flight dates/times/locations/types to avoid scheduling in order to keep their flights on time. They can also potentially use our model to evaluate their performance relative to other airlines.

Government agencies (DOT, FAA): Government agencies like the FAA would be stakeholders because they monitor all US flights and can play a role in helping to mitigate delays. They can regulate certain flight attributes that tend to lead to delays, such as plane type or departure time, and they can reprimand airlines or airports whose flights are most frequently late.

0.5 Data quality check / cleaning / preparation

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels.

If the tables in this section take too much space, you may put them in the appendix, and just mention any useful insights you obtained from the data quality check that helped you develop the model or helped you realize the necessary data cleaning / preparation.

The frequency ratio of the most and least recurrenct value for origin_airport is: 665.0 The frequency ratio of the most and least recurrenct value for destination_airport is: 692.0 It seems that given the statistic values and p-values derived from both the t-tests and Mann-Whitney U Tests and given our observation of the frequency distribution of the airports in all the samples and the overall dataset, the first sample, flights_sample1, may be the safest and most representative training set of all.

Check the Appendix for the distribution of our categorical and continuous predictors!

0.5.1 Data Cleaning and Wrangling

To prepare our data for modeling, the following steps were followed: 1. Read in airports and flights datasets. The flights dataset will be the one from which we get our predictors and response variable.

  1. Merge airports with flights datasets. The purpose of this is to obtain the location information for both the origin and destination of the flight. Therefore, flights was merged twice with airports, once on the origin airport, and once on the destination, to get latitude and longitude for each.

  2. Remove unnecessary columns from the flights dataset. The specific reason for each removal was explained in the code, but mainly falls into two categories:

    • Column contains information irrelevant to predicting flight delays. These are columns like flight number and airplane tail number, which are arbitrary and have no bearing on the length of departure delay for the flight.

    • Column contains information that is obviously correlated with another column (multicollinearity) or in some cases is a direct mathematical derivation from other columns.

    For example, the arrival_delay time of a flight is obviously correlated with the departure_delay, since a flight that leaves late will usually arrive late. Additionally, arrival delay cannot be used as a predictor of departure delay because it occurs after departure. Another example is the departure_time, which is merely the scheduled_departure + departure_delay. Since the latter two variables are included in the data, departure_time should be excluded because it is useless information.

  3. Convert time columns to minutes since midnight. Some time columns, such as scheduled_departure, were listed in a time format, i.e., a 7:59 am departure would be listed as 759 but an 8:00 am departure would be listed as 800. To create a continuous variable, these times were modified to be the number of minutes since midnight. For example, 7:59 am would be 479 and 8:00 am would be 480.

  4. Add a day_of_year column that is a continuous measure of date, rather than the existing month/year columns which discrete. For example, February 1st would be day 32.

  5. Upon conducting some early EDA and base modeling, our team realized that an important likely predictor of flight delays was missing entirely from our dataset: weather. To remedy this problem, we brought in the previously mentioned temperature dataset, which contains daily average temperatures for most cities in our flights dataset. An initial search of each dataset found that around 70% of all flights were matched with cities in the weather dataset. However, there were some airports whose listed location in flights did not match the city listed in temperature. For example, Washington Dulles airport is listed as Chantilly, VA in flights, but this small locale is not in temperature, so Washington DC was renamed to Chantilly in temperature to allow a match between datasets. After replacing names, over 85% of all flights were matched to a temperature, and the two datasets were merged. The merge process was similar to the previous one in that flights was merged twice with temperatures, both for origin and destination airports.

  6. There were a few null values present in some columns of data. Because of the large size of the dataset, rows containing null values were dropped.

0.5.2 Data Preparation

After cleaning, the resulting dataset was written out to data/flights_clean.csv for use in modeling and EDA.

An additional dummy variable dataset was written to data/flights_clean_numerical.csv. However, this file was around 3.5 GB and over 150 columns, so it was too large to use for the variable selection and shrinkage methods below.

After attempting to use the dummy variable dataset, we went back and wrote out an additional cleaned dataset, data/flights_clean_numerical_significant.csv. This dataset contains all numerical columns as well as dummy variables for categorical levels which were significant in the base model.

0.6 Exploratory data analysis

Put the relevant EDA here (visualizations, tables, etc.) that helped you figure out useful predictors for developing the model(s). Only put the EDA that ended up being useful towards developing your final model(s).

List the insights (as bullet points) you got from EDA that ended up being useful towards developing your final model.

Again, if there are too many plots / tables, you may put them into appendix, and just mention the insights you got from them.

Aymen

EDA: It seems that given the statistic values and p-values derived from both the t-tests and Mann-Whitney U Tests and given our observation of the frequency distribution of the airports in all the samples and the overall dataset, the first sample, flights_sample1, may be the safest and most representative training set of all.

EDA: After filtering out the correleation values that are less than 0.45, it seems that there are only a few pair of variables that have a significant degree of correleation with each other, indiciating potential multicollinearity between them.

EDA: The most important row on this pairplot is the row that includes the relationships of all the variables against departure_delay. From a glance, one could see that there are some particular variables that most likely have a significant relationship that could advantegously be used in our ultimate inference model and could also see that certain predictor variables also have a significant relationship with other predictor variables, of which most, if not all, can be ignored given that they are expected to manifest such a trend, i.e. day_of_year and month.

EDA: Based on the Gaussian and Non-Gaussian Kernel Density Estimate Plots of the continuous predictors against departure_delay, it does not seem like a partition, or binning, of some defined-width would be necessary. It does indicate that there is a high degree of concentration of localized-points within certain sections where there seems to be a non-linear trend (i.e. logarithmic trend) which is deceiving; the aforementioned concentration might cause the regression trend to lean towards becoming more linear and closer to it than manifesting a non-linear trend as one would think from just looking at a non-(G)KDE plot.

EDA: Days 274 to 304 are missing from the flights data that we have compiled and read for an unknown reason. It seems that the data might have been corrupted, and hence, discarded by the entity that has produced and published this data.

EDA: Based on the autocorrelation method that was implemented, a 4-day lag ended up being relatively the most useful, even though the correleation value that was associated with it is 0.17, which indicates a very low-degree of an influence by any time-based crrrelation.

0.7 Approach

What kind of a model (linear / logistic / other) did you use? What performance metric(s) did you optimize and why?

We used a linear model because we were predicting a continuous predictor. We tried to optimize performance metrics such as AIC, BIC, and especially RMSE. We wanted to optimize RMSE over other performance metrics as its greatest advantage is that it takes into account large errors which is important for our model regression. This is critical for managing travel schedules effectively, as airlines and passengers require accurate and precise predictions. RMSE measures the average difference between the predicted and actual flight delay times and ensures that errors are in the same units as the target variable, allowing for the expected deviation in flight delay times to be quantified. In contrast, other metrics like MAE or R-squared do not provide a clear understanding of the error magnitude and may not be suitable for flight delay prediction. MAE treats all errors equally, and R-squared measures the proportion of variation in the data explained by the model. Therefore, RMSE is essential for making informed decisions regarding flight schedules and alternative plans.

Is there anything unorthodox / new in your approach?

Since we had a very large dataset (~5 million rows), we were not able to run certain variable selection methods without reducing the sample size or the number of predictors due to computational limitations. One of our workarounds was to include all numerical predictors with the significant categorical predictors from the base into our variable selection. Although this may not have necessarily resulted in the most optimal set of predictors, it allowed us to reduce the number of predictors while still retaining important information.

What problems did you anticipate? What problems did you encounter? Did the very first model you tried work?

  • Just from looking at the columns of the raw dataset, we knew that there would be a lot of cleaning and data wrangling that would need to be done first. Many of the columns were also clearly collinear (e.g. arrival_time very strongly correlated with scheduled_arrival). After cleaning, we also did anticipate possibly not having enough relevant predictors to make strong inferences about flight delays, so we also added daily temperature data as we reasoned that extreme temperatures could cause more flight delays which hopefully increase the accuracy of our analysis.

  • The biggest problem we encountered was the size of our dataset. We did not want to sacrifice too many of our raw observations as to mitigate making inaccurate inferences, so we worked around this only using numerical data for our first round of variable selection and shrinkage methods. However, we wanted to be able utilize at least some of the categorical predictors, so through some EDA, we decided to use significant categorical levels in an updated dataset with all the numerical predictors, as mentioned earlier.

  • The baseline model that we ran first did not produce great results, so we knew there was a lot of room for improvement. We were only able to run it a few times since running it on the whole dataset would take ~15 minutes.This baseline model yielded an R-squared of 0.050 and an RMSE of 37.4 minutes. For comparison, the standard deviation of the response variable was 37.8 minutes, and the mean was 10.1 minutes.

Did your problem already have solution(s) (posted on Kaggle or elsewhere). If yes, then how did you build upon those solutions, what did you do differently? Is your model better as compared to those solutions in terms of prediction / inference?

No, our problem did not have already have solutions posted. We built our model from scratch.

0.8 Developing the model

0.8.1 Variable Selection

0.8.1.1 Best Subset Selection:

One of our main methods of improving the base model was through various variable selection methods. We first began by using best subset selection which we found very quickly to be not computationally feasible. So we first only ran a small sample of the data (see appendix):

We also only ran our initial variable selection on numerical variables and took out all categorical variables (see appendix):

Once the best subset function finished (took almost 30 minutes to run only on a sample of 10000), the results for which model to use differed based on the metric (see appendix):

As shown above, based on adjusted R-squared and AIC, the optimal number of predictors is 12. We decided to optimize AIC over BIC, since AIC places on a smaller penalty on models with more variables, which made sense in this case because a we had a large predictor set and our baseline model of all predictors only yielded an R-squared of 0.02. With this best subset model of 12 predictors, the R-squared was 0.024.

Since we were only running this on a such small sample (only represented 0.2% of the entire dataset), we couldn’t really use these models for reliable inference, so we then moved on to using forward selection, which was the perfect variable selection method to use due to its relative computational efficiency.

0.8.1.2 Forward Selection:

Like with best subset selection, the first time we ran forward selection on the dataset, we only computed it on a set of only the numerical predictors, as using categorical predictors was not computationally possible. I was not even able to properly read the full “dummies” version of the dataset, let alone run any functions on it. Running forward selection only took 4.5 minutes, displaying its computational advantage over other methods of variable selection. We found that using all 16 of the numerical predictors would yield the best model, which was consistent across optimizing AIC, BIC, and adjusted R-squared as shown below (see appendix):

Here is the model equation (see appendix):

Even though our model R-squared was the same as the baseline model’s, the new model with 16 predictors had a small decrease in RMSE by 0.1 to 37.3

0.8.1.3 Backward Selection:

Next, we ran backward selection on the numerical dataset. Running the function took almost 7 minutes, and yielded the following results(see appendix): As shown above, backward selection yielded very similar results to forward selection, essentially telling us to use all the numerical predictors. This makes sense because backward selection starts with a full model and forward selection starts with an empty model, and since both functions were using the same predictors it checks out that they would have similar results since the best model was using all predictors

Since the model was essentially the same as the first forward selection model, the RMSE was also 37.3.

0.8.1.4 Forward Selection: Updated Dataset

Due to the aforementioned issues regrading using the numerical dataset, for the second round of forward selection we used an updated dataset using significant categorical levels from the baseline model (see appendix).

As shown above, using 23 predictors would yield the best model which optimized AIC. In this new model with 23 predictors, the RMSE stayed the same at 37.45.

Conclusions:

With this set of variable selection, we were not able to improve the baseline model by a substantial amount. Our biggest obstacle was working with such a large dataset and still properly utilizing the dataset to make reasonable inferences. Ideally, we would have liked run best subset selection on the full dummies dataset, however that would be impossible for any normal computer to run (dummies dataset was over 150 columns). Even with our workaround of including significant categorical levels from the baseline model, we weren’t able to improve the model by much, and we could only run the updated dataset fully on forward selection, as the updated dataset was too big to run through backward selection, let alone best subset selection.

0.8.2 Shrinkage Methods

After completing variable selection, we used shrinkage methods to attempt to improve the performance of our model by shrinking coefficients.

**Note: See appendix for all associated graphs.

0.8.2.1 Ridge Regression

We first attempted ridge regression on only the numerical data in flights. We did this by dropping all categorical variables from the dataset. The following steps were taken to perform ridge regression:

  1. Split the data into train and test using train_test_split to evaluate performance later.

  2. Standardize the predictors X into Xstd using StandardScaler().

  3. Set the alphas space to consider when searching for the optimal lambda tuning parameter for the shrinkage penalty.

  4. Fit a regression for each possible lambda value and produce a coefficient vs. lambda graph:

  5. Use cross validation to find the optimal performing lambda value.
    lambda = 0.051

  6. Optimal lambda graphed with the coefficients:

  7. Standardize the test data (Xtest_std) for testing performance.

  8. Use the optimal model to predict the departure_delay for test data.

  9. Calculate the RMSE for the test data predictions to evaluate (and standard deviation of test data to compare):
    RMSE = 37.422
    STD = 37.787

  10. Calculate R-squared scores for train and test data:
    R^2 train = -48.058
    R^2 test = -48.122 Notably, this RMSE is virtually the same as the RMSE obtained without ridge regression. Therefore, we sought further improvements.

After completing the first round of variable selection and shrinkage methods, we wanted to incorporate categorical variables into our modeling, and due to the aforementioned issues with our dummy variable dataset, we used the flights dataset with all numerical columns and significant categorical levels in order to do this. The same procedure was followed, with the following results:

Optimal lambda = 0.057

RMSE = 37.111
STD = 37.546

R^2 train = -41.227
R^2 test = -40.564

This round of ridge regression produced a slight improvement in model performance (RMSE) but not a large one. We decided to try lasso regression to compare its performance.

0.8.2.2 Lasso Regression

Unlike ridge regression, lasso will completely remove insignificant predictors (i.e., their coefficients will go to 0). Similar to ridge, we first performed lasso on only numerical data to compare performance. This process was much the same as the process for ridge:

  1. Take a sample of the data (n=10000) on which to perform lasso regression. This was necessary because the lasso algorithm would not successfully run on all the data.
  2. Split the data into train and test using train_test_split to evaluate performance later.
  3. Standardize the predictors X into Xstd using StandardScaler().
  4. Set the alphas space to consider when searching for the optimal lambda tuning parameter for the shrinkage penalty.
  5. Fit a regression for each possible lambda value and produce a coefficient vs. lambda graph:
  6. Use cross validation to find the optimal performing lambda value.
    lambda = 0.581
  7. Optimal lambda graphed with the coefficients:
  8. Standardize the test data (Xtest_std) for testing performance.
  9. Use the optimal model to predict the departure_delay for test data.
  10. Calculate the RMSE for the test data predictions to evaluate (and standard deviation of test data to compare): RMSE = 41.092
    STD = 41.482 R-squared was not calculated because it was unused in evaluating the previous ridge models.

This first lasso attempt was poor, with a much worse RMSE than either ridge regression, although the sample standard deviation was also higher. We again tried lasso with the same modified dataset of numerical + select categorical data, under the same process. However, this time, we also added the following variable transformations according to our EDA to try to further improve the model:

Transformations:

  • log(distance)
  • log(scheduled_time)
  • log(taxi_in)
  • log(taxi_out)

Binned variables (added as dummy variables for each bin):

  • binned origin_latitude
  • binned origin_longitude
  • binned destination_latitude
  • binned destination_longitude
  • binned day_of_year

Following are the results of this model:

Optimal lambda = 0.068

RMSE = 34.526
STD = 35.094

By RMSE, this was our best-performing model.

Put the final model equation.

departure_delay = -0.280062 * day -0.410927 * day_of_week -1.408435 * day_of_year -0.476174 * destination_temperature +0.838358 * origin_latitude +0.592646 * origin_longitude -0.479043 * origin_temperature +1.295893 * scheduled_arrival +3.007562 * scheduled_departure +1.016395 * taxi_in +3.509715 * taxi_out -0.832837 * airline_AS +0.872424 * airline_NK +1.509344 * airline_UA +0.426376 * destination_airport_BTV -0.081166 * destination_airport_DTW -0.240464 * destination_airport_FNT +0.660775 * origin_airport_CMH -0.588643 * origin_airport_IAD -0.384632 * origin_airport_LNK -0.294619 * origin_airport_RIC -0.443162 * state_destination_MI +0.000075 * state_destination_VT +0.353779 * state_origin_NE +0.361776 * log_distance -0.622060 * log_taxi_in -2.651841 * log_taxi_out +0.316994 * destination_latitude_(32.192, 43.066] -0.806106 * destination_latitude_(43.066, 53.94] -0.121632 * destination_longitude_(-158.01, -136.019] -0.267983 * destination_longitude_(-114.116, -92.212] +0.239962 * destination_longitude_(-92.212, -70.309] +0.605814 * origin_latitude_(21.275, 32.192] -0.192634 * origin_latitude_(43.066, 53.94] -0.395085 * origin_latitude_(53.94, 64.814] +0.055742 * origin_longitude_(-158.01, -136.019] +0.338428 * origin_longitude_(-136.019, -114.116] -0.807627 * day_of_year_(0.636, 46.5] -0.805485 * day_of_year_(46.5, 92.0] +1.538830 * day_of_year_(137.5, 183.0] +1.396959 * day_of_year_(183.0, 228.5] -0.013491 * day_of_year_(228.5, 274.0] -0.678390 * day_of_year_(274.0, 319.5]

Did you succeed in achieving your goal, or did you fail? Why?

Ultimately, we cannot say that we achieved our goal. While we improved our model through the methods discussed above, the RMSE remained in the neighborhood of 35 to 40 minutes. Since the mean departure delay is 10 minutes, this RMSE means the model cannot reliably detect the lateness of a flight, and it inhibits our ability to infer insights from the data. We have still done so in the conclusions section to follow, but bear in mind that these insights are at the mercy of our poorly-performing model.

0.9 Limitations of the model with regard to inference / prediction

Our initial goal as a team, as previously discussed, was to develop a predictive model that could predict delay times for flights in 2023. However, as previously discussed, this became impossible when we settled on our 2015 dataset. Instead, our project is inference-based; we used our model equation and data analysis to identify factors that contribute to flight delays. In doing this, we reasoned that these factors remain fairly consistent over the years, however, this assumption is a limitation of our model. In fact, it may not be a fair assumption that the factors leading to delays are consistent.

During COVID-19, flights were delayed or canceled for unique reasons, such as safety precautions, lack of workers, and lack of demand. This past winter, flights were again canceled for novel reasons. Southwest flights were canceled due to an out-of-the-blue computer system problem. Neither of these events would have been hinted at in the inferences we drew from our model. Therefore, our ability to use 2015 data to actually infer delay patterns in 2023 is severely limited; 2015 should be used more as a starting point upon which to launch further research into delays in 2023.

0.10 Conclusions and Recommendations to stakeholder(s)

What conclusions do you draw based on your model? If it is inference you may draw conclusions based on the coefficients, statistical significance of predictors / interactions, etc. If it is prediction, you may draw conclusions based on prediction accuracy, or other performance metrics.

How do you use those conclusions to come up with meaningful recommendations for stakeholders? The recommendations must be action-items for stakeholders that they can directly implement without any further analysis. Be as precise as possible. The stakeholder(s) are depending on you to come up with practically implementable recommendations, instead of having to think for themselves.

If your recommendations are not practically implementable by stakeholders, how will they help them? Is there some additional data / analysis / domain expertise you need to do to make the recommendations implementable?

Do the stakeholder(s) need to be aware about some limitations of your model? Is your model only good for one-time use, or is it possible to update your model at a certain frequency (based on recent data) to keep using it in the future? If it can be used in the future, then for how far into the future?

** Since we are inferring the relationship between all the predictors of our model that we first analyzed, we recommend the stakeholders to look into developing localized regression model that is based on the geographical area and the airport that they are trying to analyze. They should focus on addressed certain airlines tendency of having late deprature times and should focus on addressed airports who are also known for having late deprature times. Our model was further complicated by many confounding variables which is in part due to the aformentioned influences. They should focus on creating policies that wil help enable these airlines and states to have the necessary infrastructure and the necessary tools, such as penalty-based policies, that will help motivate them to create much early and consistent departure times.

GitHub and individual contribution

https://github.com/bencaterine/ding-ding-ding

Individual contribution

Team member Contributed aspects Details Number of GitHub commits
Ben Caterine Data cleaning, ridge regression, lasso regression Cleaned and wrangled data to prepare for modeling. Shrinkage methods (ridge and lasso) to improve model performance 28
Haneef Usmani Best subset selection, Forward selection, backward selection Variable selection to improve model performance 15
Aymen Lamsahel Exploratory Data Analysis,Further Data Preparation, and Autocorreleation In-Depth Exploratory Data Analysis of all continuous and categorical variables (Scatter plots, Gaussian KDE plots, KDE-specific plots, boxplots of categorical variables, et cetera), correlation analysis, frequency distributions, autocorreleation, et cetera 12

List the challenges you faced when collaborating with the team on GitHub. Are you comfortable using GitHub? Do you feel GitHuB made collaboration easier? If not, then why? (Individual team members can put their opinion separately, if different from the rest of the team)

GitHub is not very useful for Jupyter notebook files. The differences between your version and the current version are frequently not shown due to the changes being too large, and git registers changes to files that you didn’t actually change but merely opened/ran a cell. Also, GitHub is not very easy to learn on-the-go for members who hadn’t used it before. We were lucky to have some members with GitHub experience; otherwise this would’ve been very difficult.

Github felt very overwhelming at first but has become more manageable. However, I feel that using other collaboration tools would have been less time-consuming especially for the scale of this project. If we had more than five people in a team, then I feel the advantages of using Github would be more clear; but because a lot of our work was independent, using Github felt like an extra task in the process. Because of this, I would often wait until I had made all changes to a file before pushing the changes, especially in the beginning since I was unfamiliar with pushing/pulling.

References

[1] Karen Brooks Harper. Southwest Airlines’ holiday meltdown brings on federal investigation, Dec. 27, 2022. The Texas Tribune.

Appendix

0.10.1 Best Subset Selection

0.10.2 Forward Selection

0.10.3 Backward Selection

0.10.4 Updated Forward Selection

0.10.5 Ridge Regression

0.10.6 Updated Ridge Regression

0.10.7 Lasso Regression

0.10.8 Updated Lasso Regression

1 Data Visualization and Statistical Analysis

1.0.1 Pairplot

<seaborn.axisgrid.PairGrid at 0x7f900318ccd0>

1.0.2 Correleation Analysis

day day_of_week departure_delay destination_latitude destination_longitude distance month origin_latitude origin_longitude scheduled_arrival scheduled_departure scheduled_time taxi_in taxi_out day_of_year origin_temperature destination_temperature
distance NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.98386 NaN NaN NaN NaN NaN
month NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.996337 NaN NaN
scheduled_arrival NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.648755 NaN NaN NaN NaN NaN NaN
origin_temperature NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.673695

1.0.3 Analysis of Potential Variable Interactions and Transformations

1.0.3.1 Scatter Plots of Numeric Predictors Against Response Variable

Non-Numeric Predictors, which were not included in the above plots, are:  ['airline', 'destination_airport', 'origin_airport', 'state_destination', 'state_origin']

1.0.3.2 (Gaussian) Kernel Density Estimate Plots of Numeric Predictors Against departure_delay

<Figure size 1200x900 with 0 Axes>

<Figure size 1200x900 with 0 Axes>

<Figure size 1200x900 with 0 Axes>

<Figure size 1200x900 with 0 Axes>

<Figure size 1200x900 with 0 Axes>

<Figure size 1200x900 with 0 Axes>

<Figure size 1200x900 with 0 Axes>

<Figure size 1200x900 with 0 Axes>

<Figure size 1200x900 with 0 Axes>

<Figure size 1200x900 with 0 Axes>

<Figure size 1200x900 with 0 Axes>

<Figure size 1200x900 with 0 Axes>

<Figure size 1200x900 with 0 Axes>

<Figure size 1200x900 with 0 Axes>

<Figure size 1200x900 with 0 Axes>

<Figure size 1200x900 with 0 Axes>

Non-Numeric Predictors, which were not included in the above plots, are:  ['airline', 'destination_airport', 'origin_airport', 'state_destination', 'state_origin', 'airline', 'destination_airport', 'origin_airport', 'state_destination', 'state_origin']

1.0.3.3 Distribution of Categorical Variables Against departure_delay (Response Variable)

1.0.3.4 Frequency Distribution of Predictors

1.0.3.5 Distribution of Categorical Variables Against departure_delay (Response Variable)

1.0.3.6 Frequency Distribution of Predictors